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Abstract 

We propose a novel general algorithm LHAC that efficiently uses second-order 
information to train a class of large-scale ^-regularized problems. Our method ex- 
ecutes cheap iterations while achieving fast local convergence rate by exploiting the 
special structure of a low-rank matrix, constructed via quasi-Newton approximation 
of the Hessian of the smooth loss function. A greedy active-set strategy based on 
the largest violations in the dual constraints, is employed to maintain a working set 
that iteratively estimates the complement of the optimal active set. This allows for 
smaller size of subproblems and eventually identifies the optimal active set. Empirical 
comparisons confirm that LHAC is highly competitive with several recently proposed 
state-of-the-art specialized solvers for sparse logistic regression and sparse inverse co- 
variance matrix selection. 



1 Introduction 

We consider convex sparse unconstrained minimization problem of the following general 
form 



min F(w) — A||k;||i + L(w) 



(i) 



where L : W — » M. is convex and twice differentiate and A > is the regularization 
parameter that controls the sparsity of w. More generally, the regularization term A||it;||i 
can be extended to ||A o w\\i = Ya=i ^i\ w i\ to allow for different regularization weights 
on different entries, e.g., when there is a certain sparsity pattern desired in the solution 
w. We will focus on the simpler form as in ([I]) in this work for the sake of simplicity of 
presentation, as the extension to the general form is straightforward. 

Problems of form have been the focus of much research lately in the fields of signal 
processing and machine learning. This form encompasses a variety of machine learning 



models, in which feature selection is desirable, such as sparse logistic regression Yuan 



et al. 20101 


2011 


Hsieh et al. 


2011 



Shalev-Shwartz and Tewaril 120091, sparse inverse covariance selection 



Qls en et al.|[2012l , |Scheinberg and Rish|[2009] , Lasso |Tibshimrn] [1996 



etc. These settings often present common difficulties to optimization algorithms due to 
their large scale. During the past decade most optimization effort aimed at these prob- 
lems focused on development of efficient first-order methods, such as accelerated proximal 



gradients methods |Nesterov| |2007| , |Beck and Teboullej [2009] , | Wright et al.| [2009], block 
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coordinate descent methods Yuan et al. 2011] ^ Friedman et al. 2010, 2008 1, Scheinberg and 

[20101 



Rish 1 2009 and alternating directions methods Scheinberg et al. 



These methods 

enjoy low per-iteration complexity, but typically have low local convergence rates. Their 
performance is often hampered by small step sizes. This, of course, has been known about 
first-oder methods for a long time, however, due to the very large size of these problems, 
second order methods are often not a practical alternative. In particular, constructing 
and storing a Hessian matrix, let alone inverting it, is prohibitively expensive for values 
of p larger than 10000, which often makes the use of the Hessian in large-scale problems 
impractical, regardless of the benefits of fast local convergence rate. 

Nevertheless, recently several new methods were proposed for sparse optimization 
which make careful use of second order information Hsieh et al. 2011] ? Yuan et al. |2011 



Olsen et al. 2012| , Byrd et al. |2012| . These methods explore the following special prop- 



erties of the sparse problems: at the optimality many of the elements of w are expected 
to equal 0, hence methods which explore active set-like approaches can benefit from small 
sizes of subproblems. Whenever the subproblems are not small, these new methods exploit 
the idea that the subproblems do not need to be solved accurately. In particular we take 
a note of the following methods. 



Yuan et al. |2011| proposes a specialized GLMNET Friedman et al. 2010] implemen- 



tation for sparse logistic regression, where coordinate descent method is applied to the 
lasso subproblem constructed using the Hessian of L(w) - the smooth component of the 
objective F{w). Two major improvements are discussed to enhance GLMNET for larger 
problem - exploitation of the special structure of the Hessian to reduce the complexity 
of each coordinate step so that it is linear on the number of training instances, and a 
two-level shrinking scheme proposed to focus the minimization on smaller subproblems. 
Hsieh et al.] [2011] later use the similar ideas in their specialized algorithm called QUIC 



for sparse inverse covariance selection. Benefiting from both its active-set strategy, which 
eventually converges to the optimal nonzero subspace, and its efficient use of the Hessian, 
QUIC behaves as Newton-like algorithms and is able to claim quadratic local convergence. 
Another related line of work begins with paper by Wright 20121, which proposes and an- 



alyzes an algorithm that is characterized by a two-phase minimization step for obtaining 
the improving direction, with one phase where a gradient descent step is taken towards 
minimizing the subproblem, and an enhanced phase where a Newton-like step is carried 
out in the nonzero subspace resulted from the first phase. Similar ideas are also explored 
Olsen et al.l 120121, in which a class of orthant-based Newton method is proposes such 



m 

that Newton-like algorithm is applied in an orthant face which lies in a reduced subspace 
and is identified by first taking a steepest descent step. A backtracking line search, how- 
ever, has to be put in place afterwards to project the step back onto the orthant whenever 
the step overshoots. 

The above mentioned methods share similar attributes. In particular, all of them 
incorporate actual Hessian either by confining the subproblem minimization to a smooth 

or by using it along with coordinate descent 
Active-set strategy is another key element 



subspace Wright 2012 , Olsen et al. 2012 



methods |Yuan et al.| [20111, |Hsieh et al.[ 2011 



shared by these approaches, which facilitates the use of the Hessian, often requiring only 
the reduced Hessian rather than the full one, and more importantly, help identify the 
optimal nonzero subspace and eventually achieve (with the Hessian) fast local convergence. 
Unfortunately, however, most of those active-set methods are only able to shrink the size 
of the subspace significantly when the current iterate is close enough to the optimality. 
Some algorithms are aware of this, e.g., Wright |2012| gives up the Hessian and returns to 
first-order steps if the size of the subspace exceeds 500, QUIC uses a small number e (set 



2 



to a constant) to control the subspace size, etc. Hence, the efficient use of second order 
information in large problems is still a challenge. Of the aforementioned methods, the 
ones by Yuan et al. |2011 and Hsieh et al. 201 1| produce the most satisfying results. But 
we note that both are specialized algorithms that heavily depends on the special structure 
present in the Hessians of the corresponding models. Hence their use will be limited. 
In this work we make use of similar ideas as in |Yuan et al.|[20TT] and |Hsieh et!dJ[20lT 



but further improve upon these ideas to obtain efficient general schemes. In particular, we 
use a different working set selection strategy than those used in Hsieh et al. |2011 . We 
choose to select a working set by observing the largest violations in the dual constraints. 
Similar technique has been successfully used in optimization for many decades, for instance 
in Linear Programming [Bixby et al.] |1992| and in S VM |Scheinberg| |2006| . As in QUIC and 
GLMNET we optimize the Lasso subproblems using coordinate descent, but we estimate 



the Hessian using limited memory BFGS method Nocedal and Wright 2006] because 



the low rank form of the Hessian estimates reduce the per iteration complexity of each 
coordinate descent step to a constant. 

Our goal here, as mentioned above, is to achieve second-order type convergence rate 
while maintaining a comparable per iteration complexity with that of most first-order 
methods. Following the approach as in Yuan et aTj |2011| and |Hsieh et al. [2011 , we apply 
coordinate descent methods iteratively to the lasso subproblems constructed at the current 
point. The acceleration of subproblem minimization, therefore, depends on controlling ei- 
ther the number of coordinate descent steps or the complexity of each individual step. The 
contributions of our work are thus twofolds. First of all, we adaptively maintain a working 
set of coordinates with the largest optimality violations, such that the steps we take along 
those coordinates always provide the best objective function value improvement. The 
greedy nature of this approach helps reduce the violation of optimality conditions rather 
aggressively in practice while effectively avoiding zero updates, and more importantly, 
extends/shrinks (depending on the initial point) the working set incrementally until it 
converges to the complement of the optimal active set. Secondly, we explore the use of 
the Hessian and Hessian approximations in the coordinate descent framework. We show 
that each coordinate descent step can be greatly accelerated by the use of a special form 
of limited- memory BFGS method Nocedal and Wright [2006| . For example, in the case of 
sparse logistic regression, given the Hessian of logistic loss L(w) as B = CX T DX where 



D e 



»NxN 



is a diagonal matrix and X E 



hNxp 



is the data matrix, the best implemen- 



x2m and Q E M 2mxp , we are able to 
depending on the limited-memory 



tation so far can only reduce the complexity of each coordinate descent step to O(N) 
flops | Yuan et al. [2011 , while with the help of LBFGS which approximates the Hessian 
by a low rank matrix B — 7/ — QQ where Q E 
bring that complexity down to a constant time 0(m 
parameter m which is often chosen between 5 and 20. The key observation here is that 
the Hessian approximation obtained by LBFGS is low rank unlike the true Hessian, and 
that can be exploited to expedite the computation of (Bd)i, the main expense of every 
coordinate descent step, by letting (Bd)i = 7^ — qfd, where qi is the z-th row vector of 
the matrix Q and d is cached and updated using one column of Q for each step. 

The paper is organized as follows. In Section [2] we explain how the subproblems 
are generated using LBFGS updates and working set selection. In Section [3] we explain 
how coordinate descent is applied to solve the subproblems with low rank Hessians. In 
Section[4]we present computations results on two instances of sparse logistic regression and 
five instances of sparse inverse covariance selection. The results demonstrate significant 
advantage of our approach compared to the other methods which inspired this work. 



3 



2 Outer iterations 



Based on a generalization of the sequential quadratic programming method for nonlinear 
optimization Nocedal and Wright 2006| , Tseng and Yun |2009 , our approach iteratively 
constructs a piecewise quadratic model to be used in the step computation. At iteration 
k the model is obtained by expanding the smooth component L(w) around the current 
iterate Wk and keeping the t\ regularization term, as follows: 



dk = arg min {S7L k d J r d B^d + X\\wk + d\\i} 

d,B k yo 



(2) 



where Bk can be any positive definite matrix 



Tseng and Yun 2009|. In particular, we note 



that both Hsieh et al. 2011] and Yuan et al.| |2011| choose Bk to be the Hessian of L(w) 
in which case the objective function of ([2]) will be simply composed of the second-order 
Taylor expansion of L(w) and the t\ term A||^ + 

An active-set method maintains a set of indices A that iteratively estimates the optimal 
active set A* which contains indices of zero entries in the optimal solution w* of 



A* = {i e V | (w*)i = 0} 



(3) 



where V = {l,2,...,p}. We use Ak to denote the set A at fc-th iteration. For those 
coordinates i E Ak, we fix its corresponding entry in the current iteration {wk)% such 
that it does not change from current iteration to the next (wk)i = (wk+i)i. That is, 
equivalently, to say that the descent step along that coordinate stays zero (dk)i = 0. 
Adding the active-set strategy to our descent step computation ([2]), we thus obtain 



dk = argmin{VL^ d + d Bkd + A||^ + 

d 

s.t. B k y 0,di = 0, Mi e A k } 



(4) 



Next we are going to discuss the particular choice we make in selecting the positive definite 
matrix Bk and Ak, or its complement Xk — {i ^ V \ i ^ Ak}, which we refer to in this 
paper as the working set. 



2.1 Low- Rank Hessian Approximation Bk 

We make use of Theorem [l] from Nocedal and Wright |2006 , which gives a specific form 
of the low-rank Hessian estimates, which we denote by Bk. Bk is essentially a low-rank 
approximation of the Hessian of L{w) through the well-known limited- memory BFGS 
method, which allows the capture of the curvature information to help achieve a faster 
local convergence. 

Theorem 1. Let Bq be symmetric and positive definite, and assume that the k vector 
pairs {si, U}^~q satisfy sfti > 0, Si = Wi+i — W{ and t{ — VL^+i — VL^. Let Bk be obtained 
by applying k BFGS updates with these vector pairs to Bq, using the formula: 



Bh — B t 



k-i 



Bk-iSk-iSk-i B k-i Vk-iVk-i 



8 k _iBk-lSk-l 



Vk-i s k-i 



(5) 



We then have that 



Bk = B - [B Sk T k ] 



S^BoSk 


Lk 


-l 




_ L k 






. n _ 



(6) 
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where S k and Tk are the p x k matrices defined by 
u>/iz/e Lfc and are t/ie fc x k matrices 



k)i,j 



ifi>j 
) otherwise, 

D k = diag[sg t , sl_ 1 t k - 1 ]. 



(7) 

(8) 
(9) 



For large-scale problems, BFGS method is often used in the limited-memory setting, 
known as the L-BFGS method Nocedal and Wright 2006] . Note that the matrices Sk and 



Tk are augmented by one column every iteration according to Q, and the update Q will 
soon become inefficient if all the previous pairs {s^,^} used. 

In the limited-memory setting, we maintain the set {si^ti} with the m most recent 
correction pairs by removing the oldest pair and adding a newly generated pair. Hence 
when the number of iteration k exceeds m, the representation of the matrices Sk, Tk, Lk, Dk 
needs to be slightly modified to reflect the changing nature of {s^,^}, and Sk,Tk are 
maintained as the p x m matrices. Also, Lk and Dk are now the m x m matrices. 

In the L-BFGS algorithm, the basic matrix Bq may vary from iteration to iteration. A 

(k) 

popular choice in practice is to set the basic matrix at fc-th iteration to Bq = 7^/, where 



lk = 



H-l^k- 
tk-l S k- 



(10) 



which is proved effective in ensuring that the search direction is well-scaled so that less 

(k) 

time is spent on line search. With this particular choice of Bq , we define Q and R to be 



Q = [ikSk T k ] 

sls k 



R 



T T 



and the formula to update B k becomes 

B k = Ikl - QRQ T 

= Ikl - QQ with Q 



RQ 



(11) 
(12) 

(13) 
(14) 



where 7^ is given by (10). Hence, the work to compute Bk only requires the matrix Q 



which is a collection of previous iterates and gradient differences, and that is a 2m 
by 2m nonsingular (as long as sfti > Vi) matrix whose inverse is thus easy to compute 
given m is a small constant. 

2.2 Greedy Active-set Selection Ak(Tk) 

An obvious choice of lk would be lk = {1, •••,£>}, taking into account all the coordinates 
in subproblem minimization. But as we said earlier, this can be inefficient because there 



will potentially be many non-progressive steps where z in (|24j) ends up being zero. For 
example, if d 



0, (Bkd)j turns into zero and (24) becomes 

if (VLfc)j - A > (B k )jj(w k )j, 
otherwise. 



~( B k)jj 

~(wk)j 



(15) 



5 



which is equal to zero if the jth entry of the subgradient dF, defined below, is zero 



(9F(w k )) 3 



( (VL(w k ))j + X 
[max{\V L(w k ))j\ 



(w k )j > 0, 
(w k )j < 0, 



(16) 



A,0} 







Hence, the possibly worst choice of I k would be letting I k — {i 6 V \ (dF(w k ))i — 0}, 
which will result in nothing but all zeros in coordinate descent update. 
Let us now consider two potentially good choice of X k 



r(l) 



{i € V | (dF(w k ))i + 0} 



used by Yuan et al. |2011| and Hsieh et ah] |2011| and 



.(2) 



{i € V | (w k )i ± 0} 



(17) 



(18) 



considered by Wright 



2012 



and 



Olsen et al. 



2012 



Note that x[}^ D XY } \ because in 



(2) 



practice (dF(wk))i can only be exactly zero if {wk)% = according to (16), so we have that 



if (wk)i 7^ then (dF(w k ))i ^ 0. Also note that the two sets will both converge to the 



optimal active set X k 



(1) _ <r(2) 



X* when the correct non-zeros have been identified in Wk 



and Wk is close enough to the optimal w*, because in which case the violations in the dual 
constraints of those zero entries in Wk will be zero such that = {i <EV \ (dF(wk))i = 

(2) 

0} = A k = {i G V | (wk)i = 0}. The above two observations can also be understood by 



representing 1^ as the union of two sets F^ a} and T^ U) , defined by 



(la) 



r (la) 

r(lb) 



{ieV | {dF{w k ))i ± and (w k ) t + 0} 



l£ 0) = {i£V | (dF(w k ))i + and [w k )i = 0} 



(19) 
(20) 



and we have that 1^ = 1 y ^ a> in general, and that 1^' = 1^' only when X y k 

Now let us introduce our rule to select the working set X k . Particularly, Let us use 
i :— X^\i) to denote the coordinate i € X^ that has the ith largest violations {(dFk)^ in 
the dual constraints. We then choose the working set for fc-th iteration by 



r(lo) 



(2) 



r(l) 



.(16) 



r(2) 



1=1 



where Sk is a small integer chosen as a fraction of Note that our X k is largely 

on whose size, as discussed above, is smaller than ijp as used by 



(21) 



based 



Yuan et al. 



2011 



and Hsieh et al. |2011| . This, of course, enables us to solve a smaller subproblem (|4j). 



(2) 

However, we also note that choosing X k — X k in our case is a bad idea because it does 
not allow zero elements of w to become nonzero, so the method may fail to converge. To 
ensure convergence, we have to let every coordinate enter or leave our working set 
which is one purpose of the union of the set (Ji=i ^* 



2.3 Line Search and Convergence Analysis 

After the step d k is computed, a line search procedure needs to be employed in order for 
the convergence analysis to follow from the framework by Tseng and Yun (2009 1. In this 



6 



work, we adapt the Armijo rule, choosing the step size a k to be the largest element from 
{/S ,/^ 2 ,...} satisfying 

F(w k + a k d k ) < F{w k ) + a k aA k (22) 

where < /? < 1, < a < 1, and A k := VL^d k + \\\w k + d k \\i — X\\w k \\i. The convergence 
analysis from Tseng and Yun |2009| also requires the positive deflniteness of the matrix 



B k . Since we obtain B k using LBFGS update, it is guaranteed to be positive definite as 
long as the product sfu > 0. We note that when L(w) in Q is strongly convex, then 
sfti > holds at any two points, otherwise if L(w) might be non-convex, then damped 



BFGS updating needs to be employed Nocedal and Wright 2006 . 



Algorithm 2.1 LHAC: Low rank Hessian Approximation in active-set Coordinate 

DESCENT 



1 

2 
3 
4 

5: 

6 

7: 

8: 

9 

10 
11 
12 
13 
14 
15 
16 
17 

18: 

19 
20 
21 
22 
23 



Choose an initial iterate wo, the LBFGS parameter m, the working set parameter s 
Set 5, T, D, L to empty matrix 
Set wi «— w , 71 «— 1, Q ^— 0, Q «— 
Set the iteration counter k ^— 1 
Compute the gradients VL k and dF k 
while optimality test returns false do 
Set d k <- and d <- 

Compute the working set X& — X^ U |J? = : 
for each coordinate j E X& do 
Compute (Bjfe)^- = J k - qjQj 
end for 

for each coordinate j E X& do 
Compute (B k d)j = ^ k dj — qf d 



Compute z according to (24) 
Update d k ^— d k + zej, d ^— d + zQj 
end for 



Compute the step size a k according to ( 22 ) 

Set W k +1 <~ W k + ttfcdfc 

Compute the gradients VL^+i and dF k +i 
Set tfc <- VL fc+ i - VL fc and s k <- w k+1 - w k 
Update S,T,D,L according to Theorem [I] 
Compute Q, Q according to Theorem [l] 
end while 



3 The inner problem 

At /c-th iteration given the current iterate w kl we apply coordinate descent method to the 
piecewise quadratic subproblem Q to obtain the direction d k . Suppose jth coordinate in 
d is updated, hence d r — d + zej (ej is the j-th vector of the identity). Then z is obtained 
by solving the following one-dimensional problem 

mm^B^jjZ 2 + ((VLjfe)j + (B k d)j)z 

+ X\(w k ) j + (d) j + z\ (23) 
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which is well known to have the following closed-form solution that can be obtained by 
one soft-thresholding step Donoho |1995| , [Hsieh et al| [201 1] : 



z = -c + S(c - b/a, A/a) (24) 

with a, 6, c chosen to be a = (£>&)jj, b = (VL^)j + (B k d)j and c = (w k )j + (d)j- 

As mentioned above, the special low-rank update of B k provides us an opportunity to 
accelerate the coordinate descent process. To see how, let us recall that B k = 7fc/ — QQ, 
where Q e W x2m , Q e R 2mx P and m chosen as a small constant. Clearly we do not 
need to explicitly store or compute B k . Instead, since B k is only used through (B k )u and 
(Bkd)i when applying soft-thresholding steps to updating each coordinate z, we can only 
store the diagonal elements of B k and compute (B k d)i on the fly whenever it is needed. 
Specifically, 

(B k )u = 7*. - qjqi (25) 

where qi is the zth row of the matrix Q and qi is the zth column vector of the matrix Q. 
To compute (B k d)i, we maintain a 2m dimensional vector d := Qd, then 

(B k d) t = lk d x - qjd (26) 

which takes 0(2m) flops, instead of 0(p) if we multiply d by the zth row of B k . Notice 
though, that after taking the soft-thresholding step d has to be updated each time by 
d ^— d + Zify. This update requires little effort given that qi is a vector with 2m dimensions 
where m is often chosen to be between 3 and 20. However, we need to use extra memory 
for caching Q and d which takes 0(2mp + 2m) space. With the other 0(2p + 2mp) space 
for storing the diagonal of B kl Q and d, altogether we need 0(4mp + 2p + 2m) space, 
which can be written as 0(4mp) when p ^> m. 

4 Experiments 

4.1 Sparse Logistic Regression 

The objective function of sparse logistic regression is given by 

1 N 

F{w) = A|H|i + — ^log(l + exp(-y n • ^ T x n )) 

n=l 

where L{w) — ^ X^^Li l°g(l + exp(— y n • w T x n )) is the average logistic loss function and 
{(xni Vn)}n=i £ (I^ p x { — 1 5 1}) is the training set. The number of instances in the training 
set and the number of features are denoted by N and p respectively. Note that the 
evaluation of F requires 0(pN) flops and to compute the Hessian requires 0(Np 2 ) flops. 
Hence, we choose such training sets for our experiment that N and p are large enough 
to test the scalability of the algorithms and yet small enough to be able to run on a 
workstation. 

We test the algorithms on both sparse and dense data sets. The statistics of those data 
sets used in the experiments are summarized in Table [l] In particular, one data set, RCV1 
Lewis et al] |2004| , is a text categorization test collection made available from Reuters Cor- 
pus Volume I (RCV1), an archive of over 800,000 manually categorized newswire stories. 
Each RCV1 document used in our experiment has been tokenized, stopworded, stemmed 
and represented by 47,236 features in a final form as weighted vectors. Another data set is 
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GISETTE, originally a handwriting digit recognition problem from NIPS 2003 feature se- 
lection challenge. Its feature set of size 5000 has been constructed in order to discriminate 
between two confusable handwritten digits: the four and the nine. 

We compare LHAC with the first order method FISTA |Beck and Teboulle| [2009 



GLMNET in particular, originally proposed by Friedman et al. 2010| and later improved 



and GLMNET |Friedman et~aTJ [20101 , |Yuan et al.| |2011| which uses Hessian information 



specifically for sparse logistic regression by Yuan et al. 2011 has been shown by Yuan 



|et al.| 2010| and |Yuan et aL 2011| to be the state-of-the-art for sparse logistic regression 
(their experiments include algorithm such as OWL-QN |Andrew and Gao 2007], TRON 



r|An 

|Lin and MoT] [1999 ], SCD Shalev-Shwartz and Tewi] [2009] ; BBR |Genkin et alj [2007 
etc.), so we compare LHAC with GLMNET. We implemented all three algorithms, for fair 
comparison, in MATLAB, and all the experiments were executed on AMD Opteron 2.0 
GHz machines with 32G RAM and Linux OS. 

We have generated eight training sets, with different number of training instances, 
from the two data sets RCV1 and GISETTE, four for each one. For RCV1, the training 
size increases from 500 to 2500; for GISETTE, it ranges from 500 to 5000. In all the ex- 
periments, we terminate the algorithm when the current objective subgradient is detected 
to be less than the precision e times the initial subgradient. For each training set, we solve 
the problem twice using each algorithm, one with a large epsilon, e.g., e = 10 -2 , to test an 
algorithm's ability to quickly obtain a useful solution, the other one with a small epsilon, 
e.g., e = 10 -5 , to see whether an algorithm can achieve an accurate solution. We report 
the runtime of each algorithm for all 16 experiments in Table [TJ We also illustrate the 
results in Figure [TJ 

As can be seen from the results, the advantage of LHAC really lies in the fact that it 
absorbs the benefits from both the first order and the second order methods, such that it 
iterates fast, with low per iteration cost, and that it converges fast like other quasi-Newton 
methods, by taking advantage of the objective curvature information. For example, let us 
look at how FISTA compares with GLMNET in Table [TJ which provides a good case to 
study the well-known trade-offs between first order and second order methods. Particu- 
larly, it can be seen that in most cases when e is set to 10 -2 , FISTA takes up much less 
time to terminate than GLMNET. However, FISTA falls short when high precision is de- 
manded, e.g., e set to 10 -5 , in which case GLMNET almost always terminates faster than 
FISTA. Notably, LHAC is able to perform well in both cases. In fact, it is overwhelmingly 
faster than the other two in all but one experiment (the runtime difference between LHAC 
and GLMNET is close when the training size is sufficiently small), which brings us to 
another interesting aspect about LHAC. That is, the larger the size of the training set is, 
the larger the margin becomes between LHAC and GLMNET. We illustrate this observa- 

where the runtime of each algorithm (e = 10" 



tion in Figure 1(c) 



and 



i(d) 



is plotted 

against the training size. Note that the complexity of LHAC scales almost linearly with 
the problem size (with a much flatter rate than FISTA), while that of GLMNET increases 
nonlinear ly. Figure [T(a)J and [T (b) plot the change of objective subgradient against elapsed 
cputime in one experiment. Again, it can be observed that LHAC iterates as efficient as 
FISTA in the beginning, and while FISTA gradually slows down near the optimality, it 
continues to work as GLMNET until reaching the optimality tolerance 10 -5 . 



4.2 Sparse Inverse Covariance Matrix Estimation 



In this section we compare our algorithm LHAC with QUIC by Hsieh et al. 2011 



specialized solver that has been shown by Olsen et al. |2012| and |Hsieh et al. 2011 to be 
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(a) GISETTE dense. (#Instances = 2000). 



(b) RCV1 sparse. (#Instances = 2500). 





(c) GISETTE dense. #Instances from 500 to (d) RCV1 sparse. #Instances from 500 to 2500. 
5000. 



Figure 1: Convergence plots in |l(a) and l(b)| (the y-axes on log scale). Scalability plots 
in[T(c)1and [T(d)| 



the state of the art for sparse covariance selection problem defined below 

min f(X) = -logdetX + tr(SX) + A||X||i (27) 

where the optimization variables are in a matrix form X E MP Xp that is required to be 
positive definite. 

For this experiment, we are interested in comparing the total complexity of the two 
algorithms. CPU time, as often used, will not be a reasonable complexity measure in 
this case, because QUIC is implemented in C++ and LHAC - in MATLAB. Instead, we 



decide to count the number of flops required by each algorithm to solve problem (27). 
Here we note that the work required by both algorithm consists mainly two parts - one 
for solving the subproblems and the other for computing Cholesky factorizations of X 
during line search, and that the time either algorithm spends on the first part - solving 
the subproblems - generally takes around 95 ~ 99% of the total elapsed time, as observed 
in the experiments. Hence we focus our comparisons on the first part of the work - the 
one required by applying coordinate descent to solving subproblems, and we note that 
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Table 1: CPU time comparisons on real world classification data sets. Data set rcvl has 
47,326 features and 0.17% non-zeros; Data set gisette has 5,000 features and 99% non- 
zeros, rcvl 500 denotes that the training set is rcvl and 500 training samples are used, e 
indicates the optimization tolerance on the objective subgradient. The results show that 
LHAC is significantly faster than other methods. 



Data set e 


cputime(in seconds) 
fista glmnet lhac 


. 10 
rcvl 50 o 10 -5 


4.54 o.39 2.68 
358.72 12.04 8.09 


. 10 

rcvl 1000 10 - 5 


14.28 6.18 3.53 
572.41 18.13 14.13 


i n— 2 

r-cvl 2 ooo 1Q -5 


15. zo 97.87 o.lo 
980.72 188.48 33.05 


io- 2 

ra;l 2 5oo iq-5 


17.71 170.13 9.95 
1212.92 351.60 52.00 


■ ++ 10-2 


52.91 10.93 3.23 
1009.66 35.95 54.90 


• „ 10 " 2 
gisette 2 ooo 10 - 5 


85.41 269.26 35.02 
1686.69 1195.48 229.67 


gisette 3 ooo |q-5 


101.52 507.65 40.17 
2241.95 2021.40 332.83 


io- 2 

gisette 5000 1Q _ 5 


102.78 1758.98 112.11 
2661.95 5935.72 752.87 



although LHAC generally requires more work in line search than QUIC (due to more 
outer iterations), it adds little to the total work for reasons stated above. 

Now let us describe how we compute the total complexity. Let us use k q to denote 
the number of outer iteration and k\ to denote the number of inner coordinate sweep. Let 
Tqd stand for the number of flops one coordinate descent step takes. Then we define the 
total complexity by 

Complexity — Ko^ipTcD (28) 

In theory, the two algorithms have similar k\ because they both apply coordinate descent 
to a lasso subproblem, but different n Q and Tqd- Particularly, n Q of QUIC is smaller than 
that of LHAC whereas Tqd of LHAC is smaller than that of QUIC. The reason is that 
QUIC uses the actual Hessian matrix of the smooth part tr(*SX) — logdetX rather than 
the low-rank Hessian approximation as in LHAC, which results in a different convergence 
rate (kq) and also a different complexity (Tcd) in computing the coordinate descent step 
( [24] ). As we discussed earlier, the special structure of the low-rank matrix enables LHAC 
to accelerate that step to O(m) flops with m a constant number (chosen as 10 for this 
experiment). Whereas in the case of QUIC, one coordinate step takes problem-dependent 
complexity 0(p) where p is the dimension of X, but it achieves quadratic convergence 
when close to the optimality. The above observations make it interesting to see in practice 



how the Complexity, denned in (28), will compare for the two algorithms since it is not 



obvious which one is better through theoretical analysis alone. 
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Figure 2: Comparing two strategies - the strategy used by QUIC and the one used by 
LHAC - of choosing the working set on ER data set. Note that the working set maintained 
by LHAC approaches the optimal size from below while that of QUIC from above. 



The results are presented in Table [2j and in Figure [2] where we compare our active-set 
strategy with the one used by QUIC. The way we compute the Complexity is by counting 
the number of flops in the coordinate descent step in each algorithm. For QUIC, we use 
their C++ implementation Hsieh et al. 201 1| and add a counter directly in their code; 
for LHAC, we use our own MATLAB implementation. We report the results on real 
world data from gene expression networks preprocessed by Li and Toh 2010|. We set the 
regularization parameter A = 0.5 for all the experiments as sugg ested in |Li and Toh| [2010[ 



Similarly to the sparse logistic regression experiments, we solve each problem twice with 
different precision e = 10 -2 and e = 10 -6 . Note that in all experiments QUIC consistently 
requires more flops to solve the problem than LHAC does. In one case where low precision 
is used it takes QUIC nearly three times more flops; even in the case QUIC is best at, 
where high precision is demanded, it can take QUIC 1.5 times more flops to achieve the 
same precision as LHAC did. 



5 Conclusions 

We have presented a general algorithm LHAC for efficiently using second-order informa- 
tion in training large-scale ^-regularized convex models. We tested the algorithm on two 
instances of sparse logistic regression and five instances of sparse inverse covariance se- 
lection, and found that the algorithm is faster (sometimes overwhelmingly) than other 
specialized solvers on both models. The efficiency gains are due to two factors: the ex- 
ploitation of the special structure present in the low-rank Hessian approximation matrix, 



12 



Table 2: Total complexity comparisons, p stands for dimension, e indicates the optimiza- 
tion tolerance on the objective subgradient. 



Data set p e 


Complexity( x 10 6 ) 

QUIC LHAC 


1(T 2 

Lymph 587 _ fi 
10 D 


47 13 
105 45 


io~ 2 

ER 692 


218 78 
353 217 


10" 2 

Arabidopsis 834 


559 329 
1001 804 


Leukemia 1255 


2101 574 
3028 132G 


10~ 2 

Hereditary 1869 


16558 8333 
18519 16770 



and the greedy active-set strategy, which correctly identifies the non-zero features of the 
optimal solution. 
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